# Reinforcement Learning Training
Thinkless 1.5B RL DeepScaleR
Apache-2.0
Thinkless is a large language model trained via reinforcement learning, capable of adaptively selecting short or long-chain reasoning modes, significantly reducing inference computational costs.
Large Language Model
Transformers

T
Vinnnf
197
1
Mimo 7B Base
MIT
A 7B-parameter specialized inference language model series launched by Xiaomi, significantly enhancing mathematical and code reasoning capabilities through optimized pre-training and post-training strategies
Large Language Model
Transformers

M
XiaomiMiMo
12.75k
101
Mimo 7B SFT
MIT
MiMo-7B-RL is a reinforcement learning model trained based on the MiMo-7B-SFT model, achieving performance comparable to OpenAI o1-mini in mathematical and code reasoning tasks.
Large Language Model
Transformers

M
XiaomiMiMo
1,183
23
VL Reasoner 7B
Apache-2.0
VL-Reasoner-7B is a multimodal reasoning model trained using GRPO-SSR technology, demonstrating outstanding performance across multiple multimodal reasoning benchmarks.
Text-to-Image
Transformers English

V
TIGER-Lab
126
1
Timezero ActivityNet 7B
TimeZero is a reasoning-guided large-scale vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks, achieving dynamic video-language relationship analysis through reinforcement learning methods.
Video-to-Text
Transformers

T
wwwyyy
142
1
Timezero Charades 7B
TimeZero is a reasoning-guided large vision-language model (LVLM) specifically designed for temporal video grounding (TVG) tasks. It identifies temporal segments in videos corresponding to natural language queries through reinforcement learning methods.
Video-to-Text
Transformers

T
wwwyyy
183
0
Openchat V2
Other
The OpenChat v2 series is a language model based on the LLaMA-13B framework, trained with conditional weighted loss, surpassing ChatGPT performance in multiple benchmarks.
Large Language Model
Transformers English

O
openchat
1,090
13
Promptist
Promptist is a reinforcement learning-based automatic prompt optimization tool designed for Stable Diffusion, transforming user input into model-preferred prompts.
Text Generation
Transformers

P
microsoft
478
66
Dqn SpaceInvadersNoFrameskip V4
This is a reinforcement learning agent based on the DQN algorithm, specifically designed to play SpaceInvadersNoFrameskip-v4, trained using the stable-baselines3 library.
Video Processing
D
0xrushi
13
0
Dqn Mountaincar V0 Zoo
This is a reinforcement learning agent based on Deep Q-Network (DQN), specifically designed to solve tasks in the MountainCar-v0 environment.
Physics Model
D
Galeros
16
0
Dqn Mountaincar V0
This is a reinforcement learning agent based on Deep Q-Network (DQN), specifically trained to solve control problems in the MountainCar-v0 environment.
Physics Model
D
Galeros
18
0
Dqn SpaceInvadersNoFrameskip V4
This is a DQN agent trained using the Stable Baselines3 library, specifically designed to play the SpaceInvadersNoFrameskip-v4 game.
Video Processing
D
ThomasSimonini
32
1
Ppo BipedalWalker V3
This is a PPO agent model trained using the stable-baselines3 library, specifically designed for reinforcement learning tasks in the BipedalWalker-v3 environment.
Protein Model
P
sb3
22
0
PPO LunarLander V2
This is a reinforcement learning model based on the PPO algorithm, specifically trained for the LunarLander-v2 environment to safely control the lunar lander.
Physics Model
P
BioGeek
102
0
Dqn LunarLander V2
This is a DQN agent trained using the stable-baselines3 library to solve reinforcement learning tasks in the LunarLander-v2 environment.
D
araffin
54
2
Ppo Pendulum V1
This is a reinforcement learning model based on the PPO algorithm, specifically designed to solve control problems in the Pendulum-v1 environment.
Physics Model
P
sb3
51
2
Ppo PongNoFrameskip V4
This is a PPO agent trained using the stable-baselines3 library, specifically designed to play the Atari game PongNoFrameskip-v4.
Video Processing
P
ThomasSimonini
148
1
Featured Recommended AI Models